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Abstract 

We study linear models under heavy-tailed priors from 
a probabilistic viewpoint. Instead of computing a single 
sparse most probable (MAP) solution as in standard deter- 
ministic approaches, the focus in the Bayesian compressed 
sensing framework shifts towards capturing the full poste- 
rior distribution on the latent variables, which allows quan- 
tifying the estimation uncertainty and learning model pa- 
rameters using maximum likelihood. The exact posterior 
distribution under the sparse linear model is intractable 
and we concentrate on variational Bayesian techniques to 
approximate it. Repeatedly computing Gaussian variances 
turns out to be a key requisite and constitutes the main com- 
putational bottleneck in applying variational techniques in 
large-scale problems. We leverage on the recently pro- 
posed Perturb-and-MAP algorithm for drawing exact sam- 
ples from Gaussian Markov random fields (GMRF). The 
main technical contribution of our paper is to show that es- 
timating Gaussian variances using a relatively small num- 
ber of such efficiently drawn random samples is much more 
effective than alternative general-purpose variance estima- 
tion techniques. By reducing the problem of variance es- 
timation to standard optimization primitives, the resulting 
variational algorithms are fully scalable and parallelizable, 
allowing Bayesian computations in extremely large-scale 
problems with the same memory and time complexity re- 
quirements as conventional point estimation techniques. We 
illustrate these ideas with experiments in image deblurring. 



1. Introduction 

Sparsity: Deterministic and Bayesian viewpoints Spai- 
sity has proven very fruitful in data analysis. Early methods 
such as total variation (TV) modeling lISTl . wavelet thresh- 
olding 123)1 . sparse coding 1261 . and independent compo- 
nent analysis f6l have had big impact in signal and image 
modeling. Recent research in compressed sensing |i4,,8J has 
shown that high-dimensional signals representable with few 



non-zero coefficients in a linear transform domain are ex- 
actly recoverable from a small number of measurements 
through linear non-adaptive (typically random) operators 
satisfying certain incoherence properties. Signal recovery 
in these deterministic models typically reduces to a convex 
optimization problem and is scalable to problems with mil- 
lions of variables such as those arising in image analysis. 

The deterministic viewpoint on sparsity has certain 
shortcomings. In real-world applications the theoretical as- 
sumptions of compressed sensing are often violated. For 
example, filter responses of natural images exhibit heavy- 
tailed marginal histograms but are seldom exactly zero ll22l . 
In practical applications such as image inpainting or deblur- 
ring the measurement operators are fixed and do not satisfy 
the incoherence properties. In these setups it is impossible 
to exactly reconstruct the underlying latent signal and it is 
important to quantify the associated estimation uncertainty. 

Along these lines, there is a growing number of studies 
both in the machine learning lfTl ll0lfT9ll36l l39l and the signal 
processing literature EO which bring ideas from sparse 
modeling into a powerful Bayesian statistical approach for 
describing signals and images. The most distinctive char- 
acteristic of Bayesian modeling is that beyond finding the 
most probable (MAP) solution it also allows us to represent 
the full posterior distribution on the latent variables, thus 
capturing the uncertainty in the recovery process. From 
a practical standpoint, this Bayesian compressed sensing 
framework allows learning model parameters and devising 
adaptive measurement designs in a principled way. We em- 
ploy kurtotic priors for modeling the heavy-tailed nature of 
filter responses. Beyond sparsity, we can also capture struc- 
tured statistical dependencies between model variables ll38l 
using tools from probabilistic graphical models [29], yield- 
ing methods that more faithfully describe the complex sta- 
tistical properties of real- world signals. 

Variational Bayes for sparse linear models Computing 
the exact posterior under heavy tailed priors is not tractable. 
We thus have to contend ourselves with approximate solu- 
tions, either of stochastic sampling or deterministic varia- 



tional type. In sampling techniques we represent the poste- 
rior using random samples drawn by Markov chain Monte- 
Carlo (MCMC); see |29 30 32, 33| for recent related work. 

The variational techniques in which we focus in this pa- 
per approximate the true posterior distribution with a pa- 
rameterized Gaussian which allows closed-form computa- 
tions. Inference amounts to adjusting the variational param- 
eters to make the fit as tight as possible 141]. Mostly related 
to our work are [1,10,19,36]. There exist multiple alterna- 
tive criteria to quantify the fit quality, giving rise to approx- 
imations such as variational bounding fTSl, mean field or 
ensemble learning, and, expectation propagation (EP) Il24l 
(see ||2ll28l for discussions about the relations among them), 
as well as different iterative algorithms for optimizing each 
specific criterion. These variational criteria involve some 
sort of integration over the latent variables. We should con- 
trast this with the Laplace approximation |2| which is based 
on a second-order Taylor expansion around the MAP point 
estimate and is thus inappropriate for the often non-smooth 
posterior density under the sparse linear model. 

All variational algorithms we study in the paper are of 
a double-loop nature, requiring Gaussian variance estima- 
tion in the outer loop and sparse point estimation in the in- 
ner loop |35, 36, 40|. The ubiquity of the Gaussian vari- 
ance computation routine is not coincidental. Variational 
approximations try to capture uncertainty in the intractable 
posterior distribution along the directions of sparsity. These 
are naturally encoded in the covariance matrix of the proxy 
Gaussian variational approximation. Marginal Gaussian 
variance computation is also required in automatic rele- 
vance determination algorithms for sparse Bayesian learn- 
ing f20| and relevance vector machine training f39l; the 
methods we develop could also be applied in that context. 

Variance computation: Lanczos vs. proposed Monte- 
Carlo algoritlim Estimating Gaussian variances is cur- 
rently the main computational bottleneck and hinders the 
wider adoption of variational Bayesian techniques in large- 
scale problems with thousands or millions of variables such 
as those arising in image analysis, in which explicitly stor- 
ing or manipulating the full covariance matrix is in general 
infeasible. Computing variances in Gaussian Markov ran- 
dom fields (GMRFs) with loops is challenging and a host of 
sophisticated techniques have been developed for this pur- 
pose, which often only apply to restricted classes of mod- 
els II2T.42I. A general-purpose variance computation tech- 
nique 12711341 is based on the Lanczos iterative method for 
solving eigenproblems fTT| and has been extensively stud- 
ied in the variational Bayes context by Seeger and Nick- 
isch 13611371 . Unless run for a prohibitively large number of 
iterations, the Lanczos algorithm severely underestimates 
the required variances, to the extent that Lanczos is inad- 
equate for optimizing criteria like expectation propagation 



which are sensitive to gross variance estimation errors 113511 . 

The main technical contribution of our work is to demon- 
strate that the sample-based Monte-Carlo Gaussian vari- 
ance estimator of |f30ll performs markedly better than the 
Lanczos algorithm as the key computational sub-routine in 
the variational learning context. Our estimator builds on 
the efficient Perturb-and-MAP sampling algorithm of ll30l 
(c.f. 1291 [32I ) which draws exact GMRF samples by lo- 
cally injecting noise to each Gaussian factor independently, 
followed by computing the mean/mode of the perturbed 
GMRF by preconditioned conjugate gradients. Being un- 
biased, the proposed sample estimator does not suffer from 
the Lanczos systematic underestimation errors. In practice, 
a few samples suffice for capturing the variances with accu- 
racy sufficient for even the more sensitive expectation prop- 
agation algorithm to work reliably. Moreover, correlations 
(i.e. off-diagonal elements of covariance matrix) needed in 
certain applications are easy to compute. 

The advocated approach to Monte-Carlo variance esti- 
mation for variational learning has several other advantages. 
It is fully scalable, only relying on well-studied computa- 
tional primitives, thus allowing Bayesian inference with the 
same memory and time complexity requirements as conven- 
tional point estimation. The proposed algorithm is paral- 
lelizable, since the required Gaussian samples can be drawn 
independently on different processors. Further, we show 
how we can use the samples to estimate the free energy and 
monitor convergence of the algorithm at no extra cost. 

2. Variational Bayes for sparse linear models 

2.1. The sparse linear model: Point estimation vs. 
Bayesian inference 

The formulation of the sparse linear model we consider 
follows the setup of FT0l [36l . We consider a hidden vec- 
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which follows a heavy-tailed prior distribu- 
tion P(x) and noisy linear measurements y G K.^^ of it are 
drawn with Gaussian likelihood P(y|x). Specifically: 

K 

P(x; 0) ex n tkisl^) , P(y|x; 9) = AA(y; Hx, a'l) , 
fe=i 

(1) 
where the K rows of G = [gf ; . . . ; g^] and the M rows of 
H = [h^; . . . ; hjj] are two sets of length- A^ linear filters, 
the former mapping x to the domain s = Gx in which it ex- 
hibits sparse responses and the latter capturing the Gaussian 
measurement process The sparsity inducing potentials 
are denoted by tk{sk)- The Laplacian tk{sk) — e^'^'''*''!, 
Sk — gfe X, is a widely used form for them. In some appli- 
cations a subset of the model's aspects (H, ct^, G) can be 
unknown and dependent on a parameter vector 9; e.g., in 



'A^(x;/x,X;) = |27rS|-i/2exp(-|(x-/x)TS-i(x-/x)) 
the multivariate Gaussian density on x witli mean n and covariance S. 



blind image deconvolution 9 typically is the unknown blur- 
ring kernel k which determines the measurement matrix H. 
By Bayes' rule, the posterior distribution of the latent 
variables x given y has the non-Gaussian density 
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P(x|y) - Z-i(0)P(y|x) W tkisk) , where (2) 
fe=i 

K 

Z{e)^P{y-e) = j P{y\^)\{tu{sk)d^ (3) 



k=l 



is the evidence/ partition function. 

Point estimation corresponding to standard compressed 
sensing amounts to finding the posterior MAP configuration 
xmap — argmax,;. log P(x|y), leading to minimization of 
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fe=i 



logife(sfc). (4) 



Point estimation thus reduces to a standard optimization 
problem and a host of modern techniques have been de- 
veloped for solving it, scalable to large-scale applications. 
However, since it ignores the partition function, point es- 
timation neither provides information about the estimation 
uncertainty nor allows parameter estimation. 

In the Bayesian framework we try to overcome these 
shortcomings by capturing the full posterior distribution. 
Since it is intractable to manipulate it directly, we consider 
variational approximations of Gaussian form 

Q(x|y) aP(y|x)e^"^-^^"r-s ^_^(^.^^^A-i)^ ^ijj^ 

iQ = A^b, A = CT^H^H + G^r^G, 

r = diag(7), and h = a-^U^y + G^ (3 . (5) 

The implied form for the variational evidence is 

ZqiO) ^ Q{y; 0) = j P(y|x)p/"^-i^"r-s^^ ^^^ 

Our task in variational learning is to adjust the set of varia- 
tional parameters ^ = (/3, 7) so as to improve the fit of the 
approximating Gaussian to the true posterior distribution. 

We will mostly be focusing on log-concave sparsity in- 
ducing potentials tk{-) - i.e., logffe(-) is concave - such as 
the Laplacian. This guarantees that the posterior P(x|y) is 
also log-concave in x, and thus point estimation in Eq. (|4]i is 
a convex optimization problem. Log-concavity also implies 
that P(x|y) is unimodal and justifies approximating it with 
a Gaussian Q(x|y) in Eq. (|5]l. 

2.2. Variational bounding 

Variational bounding ifTOl [TSl l28l 136 1 is applicable to 
sparsity-inducing potentials of super-Gaussian form. The 



family of even super-Gaussian potentials is quite rich and 
superset of the family of mixtures of zero-mean Gaussians; 
it includes the Laplacian and the Student as members ||28ll . 
Super-Gaussian potentials have a useful dual representation 
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-si/{2jk)-hkhk)/2 



with 



sup-s^/7A; - 2\ogtkisk) 



(7) 



(8) 



Variational bounding amounts to replacing the potentials 
tk{sk) in Eq. (|2]i with these bounds and tuning the varia- 
tional parameters 7 (/3 is fixed to zero in this case) so as the 
variational evidence lower bounds as tightly as possible the 
exact evidence Z > Zq. This leads to the variational free 
energy minimization problem (see ll36l for the derivation) 
inf^^o0Q(7). where 



'/'q(7) =log|A| + /i(7) + inf P(x, 7) 



(9) 



^K 



with h{j) = J2k=if^k{lk) and P(x,7) = a '^\\y - 
Hx|p + s^r^^s. The A and b are given in Eq. (|5]i; note 
that A is a function of 7. 

The log-determinant term in Eq. (|9|l is what makes 
Bayesian variational inference more interesting and at the 
same time computationally more demanding than point es- 
timation. Indeed, using Eq. O, we can re-write the ob- 
jective function for MAP estimation (|4]l as 0map(x) — 
inf-y^o ^(7) + P(x, 7), showing that ^map and (J)q only 
differ in the log|A| term, which endows variational infer- 
ence with the ability to capture the effect of the partition 
function. The difficulty lies in the fact that the elements of 
the vector 7 are interleaved in log|A|. Following 028II36I . 
we can decouple the problem by exploiting the concavity of 
log|A| as a function of 7^^ = (tF^j • ■ • i7a'^)- Fenchel du- 
ality then yields the upper bound log|A| < z^7^^— g*(z), 
z ;^ 0. For given 7 the bound becomes tight for z = 
V-y-i log|A| = diag(GA^^G^), which can be identified 
as the vector of marginal variances Zk — Varg (s^ |y) along 
the directions Sk = g^x under the variational posterior 
Q(x|y) with the current guess for the parameters 7. 

This approach naturally suggests a double-loop algo- 
rithm, globally convergent when the potentials tk are log- 
concave 1.28. .36 J . In the outer loop, we compute the vector 
of marginal variances z so as to tighten the upper bound to 
log|A|, given the current value of 7. 

In the inner loop, instead of (f)Q in Eq. (|9]l we minimize 
w.r.t. X and 7 the upper bound given the newly computed z 



0q(x,7;z) =z^7 ^ + hij) + R{x,j) 
a-^\\y-Hxf+Y^I'"+'' 



fc=i 



Ik 



+ hk{jk)] ■ (10) 



We can minimize this expression explicitly w.r.t. 7 by not- 
ing that it is decoupled in the jk and recalling from d?) that 



-2\ogtk{sk) = ini^^^osl/'^k + hk{jk)- This leaves us 
with a minimization problem w.r.t. x alone 



x;z) = inf </)q(x,7;z) = 

K 

= <7-2||y-Hxf -2^1ogtfe((s2+z,)V2) . (11) 

This is just a smoothed version of the MAP point estima- 
tion problem (|4|l, also convex when tk are log-concave, 
which we minimize in the course of the inner loop with 
standard quasi-Newton methods [3J to obtain the varia- 
tional mean x. After completion of the inner loop, we re- 
cover the minimizing values for the variational parameters 

, with which we update the 
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fdlogtkj^) 



dv 



v=si+Zk 



vector of marginal variances z in the subsequent outer loop 
iteration I 



2.3. Mean field and expectation propagation 

Bounding is not the only way to construct variational 
approximations to the intractable posterior distribution 
P(x|y). The mean field (or ensemble learning) approach 
amounts to assuming a simplified parametric form Q for the 
posterior distribution and adjusting the corresponding vari- 
ational parameters ^ so as to minimize the KL-divergence 
DKLiQWP) between Q and P Q]. See Ql for a recent 
application of the mean field approximation to the problem 
of image deconvolution, where it is shown that the mean 
field updates reduce to point estimation and variance com- 
putation primitives, exactly as in the variational bounding 
approximation discussed in detail in Sec. 12.21 

Expectation propagation (EP) is yet another powerful 
variational approximation criterion, in which the variational 
parameters of the approximating distribution Q are adjusted 
so as expectations under Q and the true posterior P(x|y) 
are matched ||24| . There are various iterative sequential 
message passing-like algorithms for optimizing the EP cri- 
terion. Applying EP to large-scale problems in which our 
paper focuses is challenging. We will employ the parallel 
update algorithm of |40|, but our methods are also appli- 
cable to the recently proposed provably convergent double- 
loop algorithm of |35|. Once more, variance estimation in 
the outer loop is the computational bottleneck; see | 



3. Monte- Carlo posterior variance estimation 

As highlighted in Sec. |2] repeatedly computing poste- 
rior Gaussian variances turns out to be a key computational 
routine in all variational approximations of the sparse lin- 
ear model. With reference to Eq. (|5]), our goal is to com- 
pute certain elements of the covariance matrix S = A~^ 
or marginal variances Zk — VarQ(sfc|y) along certain pro- 
jections Sk = g^x under the variational posterior Q(x|y). 



Note that S is a fully dense N xN matrix. Thus for large- 
scale models comprising N w 10^ variables it is impossible 
to compute or store the full S explicitly. 

3.1. Lanczos variance estimation 

So far, the main candidate for variance estimation in the 
context of large-scale variational Bayes has been the Lanc- 
zos iterative method Il36l[37l . As the iteration progresses, 
the Lanczos algorithm builds a monotonically increasing 
estimate for the variances ifTTl . It can reveal in relatively 
few iterations the rough structure and relative magnitude of 
variances, but requires a very large number of iterations to 
accurately approximate their absolute values. Since it scales 
badly with the number of iterations Nl (its complexity is 
0{Nl) in time and 0{Nl) in memory due to a required 
reorthogonalization step), it is only practical to run Lanczos 
for a relatively small number of iterations, yielding gross 
underestimates for the variances. 

In practice, variational bounding has proven relatively 
robust to the Lanczos crude variance estimates Il36l[37l . 
while expectation propagation completely fails ll35l . This 
starkly contrasting qualitative behavior in the two cases can 
be explained as follows: In the presence of Lanczos vari- 
ance underestimation errors, the expression ( fTOt remains an 
upper bound of (|9]l, albeit not tight any more. Moreover, the 
variational optimization problem (fTTT i gracefully degrades 
to the point estimation problem dUi when < Zfe <C Zfc. In 
other words, despite the variance errors the algorithm does 
not collapse, although it effectively ends up solving a mod- 
ified inference problem rather than the one that it was sup- 
posed to solve. In contrast, expectation propagation works 
by moment matching and the gross variance estimation er- 
rors make iterative EP algorithms hopelessly break down. 

3.2. Efficient Monte-Carlo variance estimation with 

Perturb-and-MAP sampling 

We propose estimating variances using a sampling-based 
Monte-Carlo technique, leveraging on the efficient Perturb- 
and-MAP GMRF sampling algorithm of f30l. Although 
[301 has already suggested this possibility, it has not ex- 
plored its effectiveness in the variational Bayesian context. 

The algorithm of fSOl reduces GMRF sampling into a 
GMRF mean estimation problem. In our notation, an exact 
Gaussian sample x ^ A/'(0, A^^), with A = cr^^H^H + 
G^r~^G, can be drawn by solving the linear system 



Ax = 



r-'U^y 



G^(3. 



(12) 



The local perturbations y ~ A/^(0,ct^I) and /3 ^ 
7\A(0, r~^) are trivial to sample since T is diagonal. We ef- 
ficiently solve the linear system (fT2l l using preconditioned 
conjugate gradients (PCG) 111], employing filtering rou- 
tines for fast evaluation of matrix-vector products Ax, thus 



avoiding the costly Cholesky factorization step typically as- 
sociated with Gaussian simulation. In contrast to Lanczos, 
the memory footprint of PCG is small as only 4 length- 
N vectors need to be stored, while multiple samples can 
be trivially drawn in parallel (using, e.g., parf or in Mat- 
lab). Also note that, unlike conjugate gradients, employing 
preconditioning within Lanczos variance estimation is dif- 
ficult l!!34il and seldom used in practice. 

Having drawn Ng Gaussian samples as described, we 
employ the standard sample-based covariance estimators 
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4 = 1 



'fe,i J 



(13) 



with Sfe.i ~ &k^i- The variance estimates marginally fol- 
low scaled chi-square distributions with Ng degrees of free- 
dom Zk ^ j^x^i^s)- This implies that E {z^} = Zk, i.e., 
this estimator is unbiased, unlike the Lanczos one. Its rela- 
tive error is r = A{zk)/zk = y/Vai{zk)/zk = y/'2/Ns, in- 
dependent from the problem size N. The error drops quite 
slowly with the number of samples (Ns = 2/r^ samples 
are required to reach a desired relative error r), but vari- 
ance estimates sufficiently accurate for even the more sen- 
sitive expectation propagation algorithm to work reliably 
can be obtained after about 20 samples (which translates 
to r « 32%). One can show that Zk < 7^7^ ll36] . a con- 
sequence of the fact that measurements always reduce the 
uncertainty in Gaussian models. To enforce this important 
structural constraint, we use in place of (fTsT l the clipped esti- 
mator Zk = min(2fc, 7j7^) which behaves considerably bet- 
ter in practice while still being (asymptotically) unbiased. 

To illustrate the efficiency of the proposed Monte-Carlo 
variance estimator in the context of variational Bayesian in- 
ference, we compare in Fig. [T]the marginal variances ob- 
tained by our sample-based estimator with that of Lanczos. 
The system matrix A for this particular example is the one 
of the last iteration of the double-loop variational bound- 
ing algorithm of Sec. 12.21 applied to a small-scale 48 x 73 
deblurring problem for which it is feasible to compute the 
exact marginal variances Zk- We use the clipped version 
Zk of our estimator with Ns — 20 samples, each drawn by 
solving the linear system ( fT2] i with 20 PCG iterations as de- 
tailed in Sec. |4] Lanczos was run for Nl — 300 iterations, 
so as the runtime for the two algorithms to be the same. We 
see that the proposed sample-based variance estimator per- 
forms markedly better than Lanczos, which grossly under- 
estimates the true marginal variances. Note that for large- 
scale problems the performance gap will be even more pro- 
nounced: as we showed earlier, the relative estimation accu- 
racy r of the sample-based estimator is independent of the 
latent space dimensionality N, while the relative accuracy 
of Lanczos further deteriorates for large N lf36l Fig. 6]. 
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Figure L Scatter-plot of exact Zk vs. estimated Zk marginal vari- 
ances for a small-scale deblurring problem. We compare the the 
proposed sample-based Monte-Carlo estimator with Lanczos. 



3.3. Monte-Carlo free energy estimation 

To monitor convergence of the free energy ^ and for 
debugging purposes, it is desirable to estimate log| A| dur- 
ing the course of the algorithm. Note that this step is not a 
requisite for the variational algorithm to yield estimates for 
X or estimate the model parameters 6. 

By coercing information from the samples ic ~ 
Af{0,A^^) drawn for variance estimation, we can reli- 
ably estimate log|A| at no extra cost, provided that we 
can analytically compute log|P| for some matrix P that 
approximates A well, typically the preconditioner em- 
ployed by PCG for solving ( fT2] l. To see this, note that 
£;{cxp(0.5i^(A-P)S;)} = |A|/|P|, which suggests 
the Monte-Carlo estimator 

log|A|«log|P|-log7V3+logf^0.5xf(A-P)xJ . 

(14) 
A special case of this with P = I has been proposed before 
Q, but for the large-scale problems we consider here using 
a good reference P sa A is crucial for the estimator ( fT4] i to 
exhibit low variance and thus be useful in practice. 

4. Applications to image deconvolution 

Our main motivation for this work is solving inverse 
problems in image analysis and low-level vision such as im- 
age deblurring, inpainting, and tomographic reconstruction. 
These give rise to large-scale inference problems involving 
millions of variables. We report experimental results on im- 
age deconvolution. Our software builds on the glm-ie 
Matlab toolbox ||25]| designed for variational inference un- 
der the variational bounding |36| and expectation propaga- 
tion f40| criteria, which we have extended to include imple- 
mentations of the proposed algorithms; our extensions will 
be integrated in future releases of glm-ie. 



In image deblurring 01211131 . our goal is to recover the 
sharp image x from its blurred version y. We assume a 
spatially homogeneous degradation, typically due to cam- 
era or subject motion, captured by the measurement process 
y = Hx == k * X. In the non-blind variant of the prob- 
lem, the convolution blur kernel k is considered known (the 
problem is classically known as image restoration), while 
in the more challenging blind variant our goal is to recover 
both the sharp image and the unknown blurring kernel. 

Blind image deconvolution In the blind deconvolution 
case, the blurring kernel is considered as parameter, = k, 
which we recover by maximum (penalized) likelihood. It 
is crucial to determine k by first integrating out the latent 
variables x and then maximizing the marginal likelihood 
argmaxjj. P(y;k), instead of maximizing the joint likeli- 
hood argmaxj. (maxx i^(x, y; k)) |9, 17|. Under the varia- 
tional approximation, we use Q(y; k) from (|6]l in place of 
P(y; k). Following I110II181 , we carry out the optimization 
iteratively using expectation-maximization (EM). 

In the E-step, given the current estimate k* for the 
blurring kernel, we perform variational Bayesian inference 
as described in Sec. |2l In the M-step of the i-th iter- 
ation, we maximize w.rt. k the expected complete log- 
likelihood _Ekt {log(3(x, y;k)}, with expectations taken 
w.rt. (3(x|y;k*). The updated kernel k*+^ is obtained by 
minimizing w.r.t. k (see 1181 for the derivation) 

Pu^jilly-Hxll 

= i tr ((H^H)(A-i + ii^)) - y^Hi + (const) 



-k^Rxxk - r^yk + (const) , 



(15) 



which is a quadratic program in k; see 1 18| for the formu- 
las for Txy and Rxx- The entries in rxy accumulate cross- 
correlations between x and y; we use the variational mean 
X of ( fTTT i for computing them. The entries in Rxx capture 
second-order information for x under (5(x|y; k*); we es- 
timate them efficiently by drawing a small number of sam- 
ples (1 or 2 suffice) from A/^(0, A^^), exactly as in Sec. 13.21 
Note that 1 18] estimates Rxx by making the simplifying as- 
sumption that A is diagonal, which could potentially lead to 
a poor approximation. We add to ( fTSl l an extra Li penalty 
term Ai ||k||Li so as to favor sparse kernels. 

It is important to note that while the M-step update for 
k in (fTsT i is a convex optimization problem, the overall log- 
likelihood objective — log Q(y; k) is not convex in k. This 
means that the EM algorithm can get stuck to local min- 
ima. Various techniques have been developed to mitigate 
this fundamental problem, such as coarse-to-fine kernel re- 
covery, gradient domain processing, or regularization of the 



result after each kernel update with (fTsT i - see ||9][l2l- We 
have not yet incorporated these heuristics into our blind de- 
convolution implementation, and thus our software may still 
give unsatisfactory results when the spatial support of the 
unknown blurring kernel is large. 

Efficient circulant preconditioning Our sample-based 
variance estimator described in Sec. |3.2| requires repeatedly 
drawing samples x. For each of the samples we solve by 
PCG a linear system of the form Ax = c, where c is the 
randomly perturbed right hand side in Eq. ( fT2l l. 

The system matrix A = ct^^H-^H + G"^r^^G arising 
in image deblurring is typically poorly conditioned, slow- 
ing the convergence of plain conjugate gradients. The key 
to designing an effective preconditioner for A is to note that 
A would be a stationary operator if F = 7I, i.e., the varia- 
tional parameters 7fc were homogeneous. Following [16], 
we select as preconditioner the stationary approximation 
of the system matrix, P = ct^^H^H + 7"iG'^G, with 
7"-'^ = (1/^) X]fc=i ^aT^- ^^^ ^^^ prove that P is the 
stationary matrix nearest to A in the Frobenius norm, i.e. 
P = argmiuxgcll^ ~ ■'^ll' where C is the set of station- 
ary (block-circulant with circulant blocks) matrices |[T6ll . 
Thanks to its stationarity, P is diagonalized in the Fourier 
domain; by employing the 2-D DFT, we can compute very 
efficiently expressions of the form P^^x required by PCG 
IIT2I . Moreover, log|P| is also readily computable in the 
Fourier domain, allowing us to use the efficient free energy 
estimator ( fT4] i for monitoring convergence. Note that the 
applicability of this preconditioner extends beyond our vari- 
ance estimation setup; e.g. it could be employed in conjunc- 
tion with the MCMC-based deblurring algorithm of [i331 . 

Circulant preconditioning with P dramatically acceler- 
ates convergence of conjugate gradients. We plot in Fig. |2] 
the residual in the course of conjugate gradient iteration for 
a typical system matrix A arising in deblurring a 190 x 289 
image under the variational bounding approximation. With 
circulant preconditioning (PCG) we attain within only 10 
iterations the same level of accuracy that is reached af- 
ter 100 iterations of unpreconditioned conjugate gradients 
(CG). This substantial improvement in the convergence rate 
more than compensates the roughly 60% time overhead per 
iteration of PCG relative to CG (respectively, 80 vs. 50 msec 
per iteration on this problem). We are not aware of any work 
that similarly exploits the benefits of preconditioning in the 
context of Lanczos variance estimation. 
Image deblurring results We have carried out prelimi- 
nary image deblurring experiments using the dataset of 1 17l 
which contains images degraded by real blur due to camera 
motion, as well as their sharp versions shot with the cam- 
era still. We assume a total-variation prior, which implies 
simple first-order finite difference filters as rows of G and 
Laplacian sparsity inducing potentials ifc(sfc) = Q-Tk\sk\^ 




20 40 60 80 100 120 

Figure 2. Conjugate gradients residual norm as function of itera- 
tion count; No (CG) vs. circulant (PCG) preconditioner. 



We fix Tfe = 15 which roughly matches the image deriva- 
tive scale for typical images with values between and 1 . 
We set the noise variance to cr^ — 10^^. 

We employ the double-loop algorithms described in 
Sec. |2]for both the variational bounding (VB) and expec- 
tation propagation (EP). We use 20 samples for variance 
estimation, and allow 20 PCG iterations for solving each of 
the linear systems ( fT2b . We show the debluiTed images from 
both the VB and EP algorithms in Fig. [3] for both the non- 
blind and blind scenaria. Note that EP completely breaks 
down if we use the Lanczos variance estimator, while it re- 
liably works under our sample-based variance estimator. 

5. Discussion 

We have shown that marginal variances required by vari- 
ational Bayesian algorithms can be effectively estimated 
using random sampling. This allows applying variational 
Bayesian inference to large-scale problems, essentially at 
the same cost as point estimation. The proposed variance 
estimator can be thought as a stochastic sub-routine in the 
otherwise deterministic variational framework. 

Interestingly, efficient Perturb-and-MAP random sam- 
pUng turns out to be a key component in both the proposed 
approach to variational inference and recent MCMC tech- 
niques Il29ll30l[32i33l . Systematically comparing these two 
alternative Bayesian inference alternatives in large-scale ap- 
plications arises as an interesting topic for future work. 
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